Here’s a simplified and more understandable version of the text you provided:

---

### Creating Diverse Synthetic Datasets

In the last chapter, we explored how large language models (LLMs) can be used to create synthetic datasets to improve a local Retriever model. This approach leverages a vast collection of unlabeled documents, where each document can generate one or more synthetic queries, creating pairs of queries and documents.

But what if your project doesn’t involve information retrieval? For instance, if you are classifying legal documents but can’t send data to an external service, you will need to train a local model. The challenge here is that gathering data can become a major hurdle, potentially delaying your product development.

Let’s simplify this with an example: suppose we want to generate children’s stories. This was the focus of research by Eldan et al. (2023). Each story consists of 2-3 paragraphs with a simple plot and theme, aimed at a child’s vocabulary and understanding.

Language is more than just rules and symbols; it conveys meaning. One of the biggest challenges when using LLMs to create training data is ensuring the dataset is diverse. Even when we adjust the creativity of the model (known as the generation temperature), the stories can end up being repetitive and lack the variety we need (even for children's language). Other challenges include maintaining coherence and relevance in the generated content.

To improve diversity, the authors created a vocabulary list of around 1,500 basic words that represent a typical child's vocabulary, categorized into nouns, verbs, and adjectives. During each story generation, the model randomly picks one verb, one noun, and one adjective, which it then incorporates into the story.

This method greatly increased the variety of the dataset, making sure the stories included different vocabulary and combinations of concepts. Additionally, the authors included possible story elements, such as dialogue, plot twists, bad endings, or moral lessons, and randomly selected a few for each story.

**Prompt Example:**

Write a short story (3-5 paragraphs) using simple words that a 3-year-old child would understand. The story should include the verb “{random.choice(verbs_list)}”, the noun “{random.choice(nouns_list)}”, and the adjective “{random.choice(adjectives_list)}”. The story should feature: {random.choice(features_list)}, {random.choice(features_list)}. Remember to use only simple words!

For example, the prompt could look like this:

**Prompt:**

Write a short story (3-5 paragraphs) using simple words that a 3-year-old child would understand. The story should include the verb “decorate”, the noun “thunder”, and the adjective “ancient”. The story should have: at least one dialogue, and a bad ending. Remember to use only simple words!

**Output:**

Once upon a time, in an ancient house, there lived a girl named Lily. She loved to decorate her room with pretty things. One day, she found a big box in the attic. She opened it and saw many shiny decorations. Lily was very happy and decided to use them in her room. But outside, the sky turned dark and there was a loud thunder. Lily got scared and ran to her mom. “Mommy, the thunder is so loud!” her mom comforted her, saying, “Don’t worry, it will pass soon.” But the thunder got louder, shaking the ancient house, and suddenly, the roof fell down on her decorations. Lily was sad because her room was not pretty anymore. The end.

Using advanced text generation models like GPT-3.5 and GPT-4, we can create a large amount of synthetic content based on these prompts. By varying the words in the prompt each time, we ensure that the stories generated are quite different, even if some words remain the same. Essentially, we inject randomness into the prompts to create a diverse dataset.

Here’s a straightforward approach to generating your synthetic dataset:

1. Identify the parameters or entities that might change between different examples in your dataset.
2. Collect or create a list of these entities to fill in the gaps.
3. Generate the dataset by randomly selecting entities to use in the prompts. It’s beneficial to set the generation temperature higher than the default but below the maximum.
4. Train a local model using the generated results from ChatGPT or GPT-4.

It’s worth mentioning that one of the entities used in your prompts could serve as a class label. For example, in a sentiment classification task, you could include “positive” or “negative” in the prompt to receive a generated text that aligns with that label. Then, you can train a local classifier on this labeled data.

### Iterative Synthetic Data Generation

You can make the process even more complex by using more entities, some of which could be generated by the LLM itself. For instance, you might start with a prompt to create a story summary and include a specific sentence that must appear in the story. Later, you would use that generated summary and sentence in your final request.

**Prompt Example:**

**Summary:** {a short summary generated by LLM}  
**Features:** {copy the features from the initial prompt}  
**Sentence:** {a sentence generated by LLM that should be included in the story}  
**Words:** {copy the words from the initial prompt}  
**Story:**

**Filled-In Example:**

**Prompt:**

**Summary:** Lily and Timmy build a sandcastle together and learn to compromise, but it gets knocked over by a gust of wind. They find beauty in the broken sandcastle and play happily with a butterfly.  
**Features:** Dialogue, Foreshadowing, Twist  
**Sentence:** One day, she went to the park and saw a beautiful butterfly.  
**Words:** disagree, network, beautiful  
**Story:**

**Output:**

Once upon a time, there was a little girl named Lily. She loved to play with her toys and friends. One day, she went to the park and saw a beautiful butterfly. She tried to catch it, but it flew away. Lily’s friend, Timmy, came to play with her. They decided to build a sandcastle, but they disagreed on how to build it. Lily wanted it tall, while Timmy wanted it wide. After some arguing, they agreed to make it both tall and wide. While they built, they saw ants working together and realized they were building a network of tunnels. Suddenly, a gust of wind knocked over their sandcastle. They felt sad, but then they saw the butterfly again, which landed on the broken sandcastle, showing them its beauty even in pieces. They smiled and played happily together.

This method allows us to generate hundreds of thousands of diverse examples to train our model. For instance, if you’re creating a classifier to determine whether a text contains dialogue or a plot twist, the labels in the initial prompt indicate the target value for each generated example.

### The Value of Synthetic Datasets

A key question is whether synthetic datasets can effectively train models for real-world applications. The authors explored this in their research and found that training smaller language models using synthetic data from advanced LLMs is beneficial.

Gunasekar et al. (2023) emphasize the importance of high-quality training data. They argue that language models perform better when trained on materials that resemble well-crafted textbooks: clear, comprehensive, informative, and unbiased.

These principles were used to create a semi-synthetic dataset called Phi-1, aimed at generating Python functions based on given descriptions. The model's performance was evaluated using the HumanEval benchmark (Chen et al., 2021).

The authors stress the need for diversity in this approach. A diverse dataset exposes the model to various coding expressions and problem-solving methods, reducing the chance of overfitting to specific patterns, and enhancing its ability to tackle new or complex tasks.

To address the coding challenge, the authors created textbook-like documents focused on topics that promote reasoning and algorithmic skills. They ensured diversity by varying:

- Topics
- Target audience

Unfortunately, specific details about the prompt templates used to generate synthetic data were not disclosed, but the results were impressive. They opted to use ChatGPT (GPT-3.5) instead of GPT-4, yet still achieved excellent outcomes by fine-tuning the model with synthetic data.

**Sample Textbook-Like Output:**

**Singular Matrices**  
In linear algebra, a matrix is called "singular" if it does not have an inverse. This can also be defined as a matrix with a determinant of zero. Singular matrices are important because they represent systems of linear equations that lack a unique solution.  
Consider a 2x2 matrix `[[a, b], [c, d]]`. If the product of `ad` equals `bc` (i.e., `ad = bc`), then the matrix is singular. For example, for the matrix `[[2, 4], [1, 2]]`, the products of the diagonals are equal (2 * 2 = 4 * 1), making it a singular matrix.

```python
import numpy as np

def is_singular(matrix):
    determinant = np.linalg.det(matrix)
    return abs(determinant) < 1e-9

matrix_example = np.array([[2, 4], [1, 2]])
print(is_singular(matrix_example))  # returns True
```

The authors generated a total of 1 billion tokens to enhance the model's training set, enabling a smaller model (1.5 billion parameters) to perform comparably to much larger models (for more details, refer to Gunasekar et al. (2023)).

### Conclusion

While you might not need such an extensive synthetic dataset for your work (the authors focused on pretraining, which demands significant resources), even a smaller amount can be valuable. For example, generating tokens at a cost of $0.002 per 1,000 tokens (the standard ChatGPT rate) would total about $2

 for a million tokens, making it a cost-effective solution.